Introduction

This notebook is to explore Canadian SARS-CoV-2 genomic and epidemiological data, for discussion with pillar 6’s team and for sharing with collaborators. These analyses can spur further research within or across pillars, be used for reports (or data dashboards), given to the science communication pillar for public dissemination, and the code can be repackaged to give to public health authorities for their internal use.

Canadian genomic and epidemiological data will be regularly pulled from various sources to keep these analyses up-to-date.

## 1. LOAD processed metadata of Canadian sequences (with latest pangolin, division, and full seq IDs)
#Download metadata from gisaid, put the date here:
gisaiddate="2021_12_10"
#date=2021_12_10
#tar -axf metadata_tsv_$date.tar.xz metadata.tsv -O | tr ' ' '_'  | sed 's/\t\t/\tNA\t/g' | sed 's/\t\t/\tNA\t/g' | sed 's/\t$/\tNA/g' | awk 'NR==1 || substr($1,9,6)=="Canada" && $8=="Human"' > metadata_CANall_$date.csv &
#zip CoVaRRNet_pillar6notebook/data_needed/metadata_CANall_$date.zip metadata_CANall_$date.csv

metaCANall <- read.csv(unz(paste("./data_needed/metadata_CANall_",gisaiddate,".zip",sep=""),paste("metadata_CANall_",gisaiddate,".csv",sep="")),sep="\t")
metaCANall$Collection_date <- as.Date(metaCANall$Collection_date)
#max(metaCANall$Collection.date) - min(metaCANall$Collection.date) #time diff: 580 days

#make a pango.group column

metaCANall$pango.group <- metaCANall$Variant
metaCANall$pango.group[is.na(metaCANall$pango.group)]<-"other"
metaCANall$pango.group[metaCANall$pango.group == "VOC_Alpha_202012/01_GRY_(B.1.1.7+Q.x)_first_detected_in_the_UK"]<-"Alpha"
metaCANall$pango.group[metaCANall$pango.group == "VOC_Beta_GH/501Y.V2_(B.1.351+B.1.351.2+B.1.351.3)_first_detected_in_South_Africa" ]<-"Beta"
metaCANall$pango.group[metaCANall$pango.group == "VOC_Gamma_GR/501Y.V3_(P.1+P.1.x)_first_detected_in_Brazil/Japan"   ]<-"Gamma"
metaCANall$pango.group[metaCANall$pango.group == "VOC_Delta_GK/478K.V1_(B.1.617.2+AY.x)_first_detected_in_India"   ]<-"Delta"
metaCANall$pango.group[metaCANall$Pango_lineage == "AY.4" ]<-"Delta Plus"
metaCANall$pango.group[metaCANall$pango.group == "VOI_Lambda_GR/452Q.V1_(C.37+C.37.1)_first_detected_in_Peru"  ]<-"Lambda"
metaCANall$pango.group[metaCANall$pango.group == "VOI_Mu_GH_(B.1.621+B.1.621.1)_first_detected_in_Colombia" ]<-"Mu"
metaCANall$pango.group[metaCANall$pango.group == "VOI_Mu_GH_(B.1.621+B.1.621.1)_first_detected_in_Colombia" ]<-"Mu"
metaCANall$pango.group[str_detect(metaCANall$pango.group,"V") ]<-"other"
metaCANall$pango.group[metaCANall$Pango_lineage == "A.23.1" ]<-"A.23.1"
metaCANall$pango.group[metaCANall$Pango_lineage == "B.1.438.1" ]<-"B.1.438.1"
metaCANall$pango.group[grepl("B\\.1\\.1\\.529|BA\\.", metaCANall$Pango_lineage)] <- "Omicron"

## 2. LOAD epidemiological data (PHAC)



epidataCANall <- read.csv(url("https://health-infobase.canada.ca/src/data/covidLive/covid19-download.csv"))
#from: https://health-infobase.canada.ca/covid-19/epidemiological-summary-covid-19-cases.html?stat=num&measure=total&map=pt#a2
epidate = tail(epidataCANall,1)$date
epidataAlberta <- epidataCANall[ which(epidataCANall$prnameFR=='Alberta'), ]
#print(epidataAlberta$numtoday)

Snapshot: SARS-CoV-2 in Canada

Sequencing coverage and sharing

Note that in this demo almost all of Ontario’s sequences were not included due to the lack of complete dates in GISAID. Moving forward, using VirusSeq Portal data, we should have better coverage in Ontario.

Variants in Canada

Sequence counts for all Canadian data in GISAID, up to 2021_12_10, shows that by end of summer Alpha and Gamma were still the most sequenced variants. Note that some of the “other” category include some Delta sublineages (AYs) but overall, from the beginning of the pandemic to the fall of 2021, Canadian sequences were mostly of the wildtype lineages (pre-VOCs).

metaCANall <- filter(metaCANall, !is.na(metaCANall$Collection_date))

# --- Histogram plot: counts per variant of all canadian data -------------


listpango <- unique(metaCANall$pango.group)
print(listpango)
##  [1] "other"      "Alpha"      "B.1.438.1"  "Beta"       "A.23.1"    
##  [6] "Gamma"      "Delta"      "Lambda"     "Delta Plus" "Mu"        
## [11] "Omicron"
listpango <- listpango[order(listpango)]
getpal <- colorRampPalette(brewer.pal(9, "Paired")) #"Set3
pal <- getpal(length(listpango))
names(pal) <- listpango


#------------- counts over time (by week)

df1 <- metaCANall %>% group_by(pango.group) %>% group_by(week = cut(Collection_date, "week")) #adds week coloum
dfcount <- df1 %>% group_by(week) %>% count(pango.group)


#plot
ggplot(dfcount, aes(x=as.Date(week), y= n, fill=pango.group))+geom_bar(stat = "identity")+
  scale_fill_manual(breaks=listpango, values = pal)+ ylab("")+xlab("")+
  scale_x_date(date_breaks = "14 day")+
  theme_classic()+theme(axis.text.x = element_text(angle=90, vjust=0.1,hjust=0.1))

#------------- proportion over time (by week)

# dfprop <- dfcount %>% group_by(pango.group, week) %>% 
#   summarise(n = n()) %>% 
#   mutate(freq = prop.table(n))#n / sum(n))

dfcount %>% mutate(freq = prop.table(n)) -> dfprop

#plot
ggplot(dfprop, aes(x= as.Date(week), y=freq, fill=pango.group))+geom_bar(stat = "identity")+
  scale_fill_manual(breaks=listpango, values = pal)+ ylab("")+xlab("")+
  scale_x_date(date_breaks = "14 day")+
  theme_classic()+theme(axis.text.x = element_text(angle=90, vjust=0.1,hjust=0.1))

There are two PANGO lineages that have a Canadian origin and have predominately spread within Canada (with some exportations internationally): B.1.438.1 and B.1.1.176.

Other lineages of Canadian interest:

  • A.2.5.2 - an A lineage (clade 19B) that spread in Quebec, involved in several outbreaks, before Delta arrived
  • B.1.2 - an American (USA) lineage that spread well in Canada
  • B.1.160 - an European lineages that spread well in Canada

Canadian trees

Down-sampled Canadian SARS-CoV-2 genomes. Taken from GISAID Sept. 12th, 2021. Alignment GISAID, ML tree in IQ-tree. This tree was generated using TreeTime and visualized in ggtree.

This tree shows several features of VOC spread in Canada:

  • Alpha spread across most of Canada
  • Gamma was mostly located in BC
  • Alberta represents most of the Delta cases
  • Expansion of Alpha and Gamma coincide with a decrease in detection of wildtype variants
  • Delta is the bulk of what was being sequenced at the end of the summer

These last two points are suggestive of strain (variant) replacement, i.e. competitive exclusion, but more detailed analyses would be required to clarify this.

Divergence tree from TreeTime run, visualized in ggtree.

Evolution and growth rates of SARS-CoV-2 in Canada

There are various methods to investigate changes in evolutionary rates of VOC/VOIs and compare their relative fitness in an epidemiological context.

For example, root-to-tip plots give estimates of substitution rates. Clusters with stronger positive slopes than the average SARS-CoV-2 dataset, are an indication that they are accumlating mutations at a faster pace, or clusters that jump up above the average could indicate a mutational jump (simlar to what we saw with Alpha when it first appeared in the UK).

Maximum likelihood tree (IQ-TREE) processed with root-to-tip regression and plotting in R.
rooted <- read.tree('./data_needed/msa_0908_CANall_downsamp10perday_1.rtt.nwk')
get.dates <- function(phy, delimiter='_', pos=-1, format='%Y-%m-%d') {
  dt <- sapply(phy$tip.label, function(x) {
    tokens <- strsplit(x, delimiter)[[1]]
    if (pos < 0) { return(tokens[length(tokens)+pos+1]) }
    else { return(tokens[pos]) }
  })
  as.Date(dt, format=format)
}
tip.dates <- get.dates(rooted, pos=-2)

# total branch length from root to each tip
div <- node.depth.edgelength(rooted)[1:Ntip(rooted)]

# match tips to metadata to retrieve PANGO lineage assignments
accno <- gsub(".+_(EPI_ISL_[0-9]+)_.+", "\\1", rooted$tip.label)
index <- match(accno, metaCANall$Accession_ID)
pg <- metaCANall$pango.group[index]

blobs <- function(x, y, col, cex=1) {
  points(x, y, pch=21, cex=cex)
  points(x, y, bg=col, col=rgb(0,0,0,0), pch=21, cex=cex)
}
dlines <- function(x, y, col) {
  lines(x, y, lwd=2.5)
  lines(x, y, col=col)
}

vocs <- c('Alpha', 'Beta', 'Gamma', 'Delta', 'Omicron')
pal <- hcl.colors(n=length(vocs), palette="Sunset")

par(mar=c(5,5,0,1))
plot(tip.dates, div, type='n', las=1, cex.axis=0.6, cex.lab=0.7, bty='n',
     xaxt='n', xlab="Sample collection date", ylab="Divergence from root")
xx <- floor_date(seq(min(tip.dates), max(tip.dates), length.out=5), unit="months")
axis(side=1, at=xx, label=format(xx, "%b %Y"), cex.axis=0.6)

blobs(tip.dates[pg=='other'], div[pg=='other'], col='grey', cex=0.8)
fit0 <- rlm(div[pg=='other'] ~ tip.dates[pg=='other'])
abline(fit0, col='gray50')
fits <- list(other=fit0)

for (i in 1:length(vocs)) {
  variant <- vocs[i]
  x <- tip.dates[pg==variant]
  if (all(is.na(x))) next
  y <- div[pg==variant]
  blobs(x, y, col=pal[i], cex=0.8)
  fit <- rlm(y ~ x)
  dlines(fit$x[,2], predict(fit), col=pal[i])
  fits[[variant]] <- fit
}

legend(x=min(tip.dates), y=max(div), legend=vocs, pch=21, pt.bg=pal, bty='n', cex=0.8)

ci <- lapply(fits, confint.default)
kable(data.frame(
  n=sapply(fits, function(f) nrow(f$x)),                            
  est=29970*sapply(fits, function(f) f$coef[2]),
  lower.95=29970*sapply(ci, function(f) f[2,1]),
  upper.95=29970*sapply(ci, function(f) f[2,2])
), 
col.names = c("Number of genomes", "Estimate", "Lower 95% CI", "Upper 95% CI"),
format="html", align="rrrr", caption="Molecular clock rates (subs/genome/day)",
format.args = list(scientific = FALSE), digits=4, table.attr = "style='width:100%;'")
Molecular clock rates (subs/genome/day)
Number of genomes Estimate Lower 95% CI Upper 95% CI
other 3284 0.0478 0.0464 0.0492
Alpha 628 0.0377 0.0312 0.0442
Beta 16 0.0126 -0.0046 0.0299
Gamma 289 0.0597 0.0508 0.0686
Delta 251 0.0600 0.0443 0.0756

Overall, the evolutionary rate among genomes sampled in Canada (0.0678102 subs/genome/day, grey line) is close to the global average of 0.0674941 subs/genome/day. Compared to other lineages sampled in Canada, variant of concern Alpha (B.1.1.7) exhibited a slightly but significantly lower rate of evolution. Both variants of concern Gamma (P.1) and Delta (B.1.617.2) exhibited higher rates, although only Gamma was significantly higher.

Add here:

  • growth rates (Sally’s code)
  • dN/dS (by variant and by gene/domains)
  • Tajima’s D
  • other? (e.g. PyR0, nnet R, ?)
  • Delta and its sublineages (specific plots as above, also delta w/+w/o E484K in Canada/PTs)

Are there any BEAST analyses we’d like to do? e.g. infer R0, serial interval, etc for the different Delta sublineages (might have to restrict this geographically to specific PTs with enough coverage)

List of useful tools

Collect a list of bioinformatics, phylogenetic, and modelling tools that are useful for SARS-CoV-2 analyses:

Data sources

  • GISAID
  • VirusSeq Portal
  • PHAC
  • various PT PH websites (e.g. INSPQ)

Session info

The version numbers of all packages in the current environment as well as information about the R install is reported below.

Hide

Show

sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] MASS_7.3-53.1      viridis_0.6.0      viridisLite_0.4.0  colorspace_2.0-0  
##  [5] RColorBrewer_1.1-2 kableExtra_1.3.4   gridExtra_2.3      ggbeeswarm_0.6.0  
##  [9] DT_0.18            cowplot_1.1.1      ggtree_2.4.1       phytools_0.7-70   
## [13] maps_3.3.0         phangorn_2.6.3     tidytree_0.3.3     phylotools_0.2.2  
## [17] ape_5.4-1          treeio_1.14.3      lubridate_1.7.10   reticulate_1.23   
## [21] knitr_1.32         forcats_0.5.1      stringr_1.4.0      dplyr_1.0.5       
## [25] purrr_0.3.4        readr_1.4.0        tidyr_1.1.3        tibble_3.1.0      
## [29] ggplot2_3.3.3      tidyverse_1.3.1   
## 
## loaded via a namespace (and not attached):
##  [1] ellipsis_0.3.1          fs_1.5.0                aplot_0.0.6            
##  [4] rstudioapi_0.13         farver_2.1.0            fansi_0.4.2            
##  [7] xml2_1.3.2              codetools_0.2-18        mnormt_2.0.2           
## [10] jsonlite_1.7.2          broom_0.7.6             dbplyr_2.1.1           
## [13] png_0.1-7               BiocManager_1.30.12     compiler_4.0.2         
## [16] httr_1.4.2              rvcheck_0.1.8           backports_1.2.1        
## [19] assertthat_0.2.1        Matrix_1.3-2            lazyeval_0.2.2         
## [22] cli_2.4.0               htmltools_0.5.1.1       tools_4.0.2            
## [25] igraph_1.2.6            coda_0.19-4             gtable_0.3.0           
## [28] glue_1.4.2              clusterGeneration_1.3.7 fastmatch_1.1-0        
## [31] Rcpp_1.0.6              cellranger_1.1.0        jquerylib_0.1.3        
## [34] vctrs_0.3.7             svglite_2.0.0           nlme_3.1-152           
## [37] xfun_0.22               rvest_1.0.0             lifecycle_1.0.0        
## [40] gtools_3.8.2            scales_1.1.1            hms_1.0.0              
## [43] parallel_4.0.2          expm_0.999-6            yaml_2.2.1             
## [46] sass_0.3.1              stringi_1.5.3           highr_0.9              
## [49] plotrix_3.8-1           rlang_0.4.10            pkgconfig_2.0.3        
## [52] systemfonts_1.0.2       evaluate_0.14           lattice_0.20-41        
## [55] labeling_0.4.2          patchwork_1.1.1         htmlwidgets_1.5.3      
## [58] tidyselect_1.1.0        magrittr_2.0.1          R6_2.5.0               
## [61] generics_0.1.0          combinat_0.0-8          DBI_1.1.1              
## [64] pillar_1.6.0            haven_2.4.0             withr_2.4.1            
## [67] scatterplot3d_0.3-41    modelr_0.1.8            crayon_1.4.1           
## [70] utf8_1.2.1              tmvnsim_1.0-2           rmarkdown_2.7          
## [73] grid_4.0.2              readxl_1.3.1            reprex_2.0.0           
## [76] digest_0.6.27           webshot_0.5.2           numDeriv_2016.8-1.1    
## [79] munsell_0.5.0           beeswarm_0.3.1          vipor_0.4.5            
## [82] bslib_0.2.4             quadprog_1.5-8